Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Norm Layers, Again #1509

Closed
wants to merge 49 commits into from
Closed

Fix Norm Layers, Again #1509

wants to merge 49 commits into from

Conversation

DhairyaLGandhi
Copy link
Member

@DhairyaLGandhi DhairyaLGandhi commented Feb 16, 2021

This is a successor to #1495

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • Documentation, if applicable
  • Final review from @dhairyagandhi96 (for API changes).

Todo:

  • Go through the tests, we don't want to use the autodiff test file as it is, doing more straightforward checks is.better.
  • Remove hasaffine
  • remove testmode! - use zygote instead and have an equivalent forward pass function available a la dropout
  • determine forward pass at compile time

test/cuda/layers.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member

CarloLucibello commented Feb 17, 2021

remove testmode! - use zygote instead and have an equivalent forward pass function available a la dropout

this TODO has nothing to do with this PR.

@darsnack
Copy link
Member

And the last time it was dicussed there was not a complete consensus. I think it is better not to put this TODO on the v0.12 path.

@DhairyaLGandhi
Copy link
Member Author

Still not perfect, but better than master. We would need to figure removing the unions as well.

This:

julia> bn = BatchNorm(10)
BatchNorm(10, identity, affine = )

julia> x = rand(Float32, 3,3,3,10);

julia> @code_warntype bn(x)
Variables
  BN::BatchNorm{typeof(identity),Array{Float32,1},Float32,Array{Float32,1}}
  x::Array{Float32,4}
  #334::Flux.var"#334#335"{Array{Float32,4},Int64}
  @_4::Int64
  N::Int64
  reduce_dims::Array{Int64,1}
  affine_shape::Union{Nothing, NTuple{4,Int64}}
  ts::Union{Nothing, Bool}
  μ::Array{Float32,4}
  σ²::Array{Float32,4}
  @_11::Union{Nothing, NTuple{4,Int64}}
  @_12::Union{Nothing, Bool}

Body::Array{Float32,4}
1 ──       Core.NewvarNode(:(#334))
│          Core.NewvarNode(:(@_4))
│          Core.NewvarNode(:(reduce_dims))
│          Core.NewvarNode(:(affine_shape))
│          Core.NewvarNode(:(ts))
│          Core.NewvarNode(:(μ))
│          Core.NewvarNode(:(σ²))
│          (N = Flux.ndims(x))
[...]

Master:

julia> bn = BatchNorm(10)
BatchNorm(10)

julia> x = rand(Float32, 3,3,10,10);

julia> bn(x);

julia> @code_warntype bn(x)
Variables
  BN::BatchNorm{typeof(identity),Array{Float32,1},Float32,Array{Float32,1}}
  x::Array{Float32,4}
  #239::Flux.var"#239#240"{Array{Float32,4},Int64}
  N::Int64
  reduce_dims::Array{Int64,1}
  affine_shape::NTuple{4,Int64}

Body::Any
1 ─       Core.NewvarNode(:(#239))
│         Core.NewvarNode(:(N))
│         Core.NewvarNode(:(reduce_dims))
│         Core.NewvarNode(:(affine_shape))
│   %5  = Flux.ndims(x)::Core.Compiler.Const(4, false)
│   %6  = (%5 - 1)::Core.Compiler.Const(3, false)
│   %7  = Flux.size(x, %6)::Int64%8  = Base.getproperty(BN, :chs)::Int64%9  = (%7 == %8)::Bool

@darsnack
Copy link
Member

darsnack commented Mar 3, 2021

Based on our discussion yesterday, this PR looks ready except the plans for active. Can we agree to save those for another release, since master currently makes no changes to the active/trainmode!/testmode! API?

Other than that looks like we only need tests and docs.

Comment on lines +287 to +288
β = initβ(chs)
γ = initγ(chs)
Copy link
Member

@CarloLucibello CarloLucibello Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we should have params that we don't use in practice when affine=false?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other thought on this: I know we rejected Affine(BatchNorm(...)) because it is not ergonomically nice. But what about having a field in BatchNorm called affine that is a function. When the affine kwarg is true, then affine = Dense(...) but when it is false, affine = identity? In a Chain, the user still see just BatchNorm and bn.affine will show a Dense which the user will intuitively understand. We also don't end up storing data we don't need, and trainable will automatically be empty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of like you'd have for an activation function? That does sound pretty appealing, would it make sense to call that field transform or something then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly. transform would certainly capture the field accurately, but affine might be better here cause people will understand it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well that had been a consideration. Its a bit awkward to also do γ = reshape(l.γ, affine_shape) but I think we can have broadcasting take care of the shape as needed by l.λ.(γ .* x̂ .+ β)

affine, track_stats, nothing)
end

@functor BatchNorm
trainable(bn::BatchNorm) = hasaffine(bn) ? (bn.β, bn.γ) : ()
# trainable(bn::BatchNorm) = hasaffine(bn) ? (bn.β, bn.γ) : ()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you remove this μ, σ² will end up in the params

test/cuda/layers.jl Outdated Show resolved Hide resolved

pixelshuffle = [PixelShuffle]
gpu_gradtest("PixelShuffle 2d", pixelshuffle, rand(Float32, 3, 4, 18, 3), 3)
gpu_gradtest("PixelShuffle 1d", pixelshuffle, rand(Float32, 3, 18, 3), 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a weird one space indent here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the indent is still here?

@CarloLucibello
Copy link
Member

I'll recap what this PR when ready should not change with respect to master:

  • the constructor interface BatchNorm(3, affine=true, track_stats=false) and its default values
  • the behaviour in the forward pass
  • the layers' parameters:
julia> params(BatchNorm(2))
Params([Float32[0.0, 0.0], Float32[1.0, 1.0]])

julia> params(BatchNorm(2, affine=false))
Params([])

So, since there won't be any API changes with respect to master (unless I'm missing something), we can consider this non-blocking for v0.12 release

Co-authored-by: Carlo Lucibello <[email protected]>
@DhairyaLGandhi
Copy link
Member Author

How does one change the behaviour of a layer at different points of training? Say I don't want to perform affine transforms initially, but want to later. This params change would hinder that without adding extra boilerplate code on the user end and potentially extra copies slowing down the entire process unnecessarily. This breaks the contract of eager execution.

How do I also restart training from a given point if I don't serialise the state of the tracking variables at such a time? They would not be trained either way, so the user choice is respected.

@CarloLucibello
Copy link
Member

How does one change the behaviour of a layer at different points of training? Say I don't want to perform affine transforms initially, but want to later. This params change would hinder that without adding extra boilerplate code on the user end and potentially extra copies slowing down the entire process unnecessarily. This breaks the contract of eager execution.

It is not clear what you say in the last two sentences, but the thing is that params function should return a meaningful result, that is whatever are meant to be a layer's trainable parameters. If you declare m = BatchNorm(2, affine=false) then params(m) should be empty, anything else would be surprising and would make params meaningless.

The use case you mention can be simply achieved by creating the layer affine=true and then freezing the parameters in any
part of the training you want the same way you would do it with any set of parameters.

How do I also restart training from a given point if I don't serialise the state of the tracking variables at such a time? They would not be trained either way, so the user choice is respected.

Again, params(m) should return trainable parameters not anything in a module. Serialization has nothing to do with params.

PyTorch handles the separation between trainable parameters and the rest of the internal state very neatly:

In [15]: import torch.nn as nn

In [16]: m = nn.BatchNorm2d(2, affine=True)

In [17]: list(m.parameters())
Out[17]: 
[Parameter containing:
 tensor([1., 1.], requires_grad=True),
 Parameter containing:
 tensor([0., 0.], requires_grad=True)]

In [18]: m = nn.BatchNorm2d(2, affine=False)

In [19]: list(m.parameters())
Out[19]: []

In [20]: m = nn.BatchNorm2d(2, track_running_stats=True)

In [21]: list(m.buffers())
Out[21]: [tensor([0., 0.]), tensor([1., 1.]), tensor(0)]

In [22]: m = nn.BatchNorm2d(2, track_running_stats=False)

In [23]: list(m.buffers())
Out[23]: []

Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still some issues related to changes to the tests which aren't clear.

src/cuda/cudnn.jl Show resolved Hide resolved
src/layers/normalise.jl Show resolved Hide resolved
src/layers/normalise.jl Outdated Show resolved Hide resolved
src/layers/normalise.jl Show resolved Hide resolved
src/layers/normalise.jl Outdated Show resolved Hide resolved
testmode!(m, true)
y = evalwgrad(m, x) # should override istraining
@test count(a->a==0, y) == 0
y = m(x) # should override istraining
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still holds

y = m(x)
@test isapprox(y, sigmoid.((x .- m.μ) ./ sqrt.(m.σ² .+ m.ϵ)), atol = 1.0e-7)
@test_broken isapprox(y, mean(sigmoid.((x .- m.μ) ./ sqrt.(m.σ² .+ m.ϵ)), dims = 1), atol = 1.0e-7)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump here

test/layers/normalisation.jl Show resolved Hide resolved
test/layers/normalisation.jl Show resolved Hide resolved
Comment on lines +236 to +238
# @test length(Flux.params(InstanceNorm(10))) == 0
# @test length(Flux.params(InstanceNorm(10, affine = true))) == 2
# @test length(Flux.params(InstanceNorm(10, affine =false))) == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be changed back

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't guaranteeing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test isn't a guarantee of behavior. But this is the current behavior so it should be tested. If two weeks from now, someone borks trainable, then this test would accurately identify the issue which is why it exists.

Comment on lines +190 to +191
μnew, σ²new = track_stats(x, (l.μ, l.σ²), (μ,σ²),
l.momentum, reduce_dims = nc.dims)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these can go in the zygote block below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is okay since track_stats has @nograd

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which case the @nograd on track_stats can be removed

end

let m = trainmode!(BatchNorm(2)), x = reshape(Float32.(1:6), 3, 2, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests shouldn't be changed

Comment on lines -123 to +126
let m = BatchNorm(32), x = randn(Float32, 416, 416, 32, 1);
m(x)
@test (@allocated m(x)) < 100_000_000
end
# let m = BatchNorm(32), x = randn(Float32, 416, 416, 32, 1);
# m(x)
# @test (@allocated m(x)) < 100_000_000
# end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably this was here for a reason, could try to check with git blame when it was introduced

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blame leads back to #586. Personally I think it's a little weird to just test allocations for BatchNorm, but on the other hand this is 4 lines that run quickly and don't add any real maintenance overhead. My vote is either uncomment it or get rid of it entirely. Leaving stuff around but commented without a little explainer comment raises more questions than it answers :P

y = m(x)
μ = mean(x, dims=1)
σ² = var(x, dims=1, corrected=false)
y, back = pullback((m,x) -> m(x), m, x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when internal changes to a layer previous test shouldn't be modified so that one can be sure that previous behavior is preserved. Normalization layer behave differently in gradient contexts, this test was testing the behaviour in a non-gradient context

@CarloLucibello
Copy link
Member

there are too many changes in tests, they should be mostly reverted, otherwise it's hard to check that the revised internals don't introduce new (or old) bugs

Comment on lines +193 to 188
Zygote.ignore() do
l.μ = reshape(μnew, :)
l.σ² = reshape(σ²new, :)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put this in track_stats?

sz = sz isa Integer ? (sz,) : sz
diag = affine ? Diagonal(sz...) : nothing
return LayerNorm(λ, diag, ϵ, sz, affine)
function LayerNorm(sz, λ = identity; affine = Diagonal(sz...), ϵ = 1f-5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite what I was suggesting. I didn't mean change the affine keyword to be non-boolean. LayerNorm was fine as is. I meant storing affine as a function in BatchNorm, etc. similar to LayerNorm. The API would still be a boolean.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that, but for the future, we would likely not want to limit kwargs to Boolean, since while it is symmetric with other libraries, it is somewhat limiting what can be expressed. Specifically a boolean allows to express two states, while an anonymous function can do more than that. Since our implementation isn't limiting what can be done with the layer, we should pass those freedoms to the end user as well. This means, we can have a compatibility layer to support booleans, but in times when it isn't worth it to write ones own layer or tweak with certain settings that aren't directly expressed with the Boolean, it's handy to pass in an anonymous function instead.

We wouldn't want the Boolean to be the only api that users can interact with. In fact, over use of it can add bloat to the code base.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree completely, but the default constructor provides that flexibility? I also agree on the overuse of booleans, but for the norm layers 99.99% of users want an affine transformation (not some other function) and they just want to turn it on or off. Here, a boolean makes sense for the sake of convenience. We can still allow the advanced use case with the default constructor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default constructor is much less ergonomic and harder to use. Not to mention requiring the user to know way more details about Flux's innards that it's untenable and a poor use of time, not to mention making the user code fragile to updates to Flux.

The fact that other libraries push boolean kwargs is why I proposed having some kind of hybrid constructor that can accept either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, a hybrid constructor is fine by me

@DhairyaLGandhi
Copy link
Member Author

bumping

Copy link
Member

@ToucheSir ToucheSir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there are still a handful of comments here, so resolving those should clear up any blockers.

diag = affine ? Diagonal(sz...) : nothing
return LayerNorm(λ, diag, ϵ, sz, affine)
function LayerNorm(sz, λ = identity; affine = Diagonal(sz...), ϵ = 1f-5)
# diag = affine ? Diagonal(sz...) : identity
Copy link
Member

@ToucheSir ToucheSir Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the discussion above (hybrid constructor), I believe this line needs to be brought back.

Comment on lines +166 to +181
struct NormConfig{A,T}
dims::Vector{Int}
end

NormConfig(affine, track_stats, reduce_dims) = NormConfig{affine, track_stats}(reduce_dims)

getaffine(nc::NormConfig{true}, sz_x; dims = length(sz_x) - 1) =
ntuple(i -> i in dims ? sz_x[i] : 1, length(sz_x))

getaffine(nc::NormConfig{false}, args...; kwargs...) = ()

# For InstanceNorm, GroupNorm, and BatchNorm.
# Compute the statistics on the slices specified by reduce_dims.
# reduce_dims=[1,...,N-2,N] for BatchNorm
# reduce_dims=[1,...,N-2] for InstanceNorm and GroupNorm
function _norm_layer_forward(l, x::AbstractArray{T,N}; reduce_dims, affine_shape) where {T, N}
if !_isactive(l) && l.track_stats # testmode with tracked stats
function norm_forward(l, x::AbstractArray{T,N}, nc::NormConfig{A, true}) where {A, T, N}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that all this functionality has been pulled out, we should consider moving it into NNlib next. Getting rid of https://github.com/FluxML/Flux.jl/tree/master/src/cuda (which only contains a batchnorm override) would be excellent and save us quite a few headaches.

@CarloLucibello
Copy link
Member

closing as stale

@chengchingwen chengchingwen deleted the dg/instance2 branch February 5, 2023 17:55
@chengchingwen chengchingwen restored the dg/instance2 branch February 5, 2023 18:04
@CarloLucibello CarloLucibello deleted the dg/instance2 branch March 22, 2023 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants